Monitoring Infrastructure
Overview
In this project, I worked on monitoring applications and infrastructure
using AWS services. The ability to monitor applications and
infrastructure is critical for delivering reliable, consistent IT
services.
Monitoring requirements range from collecting statistics for long-term
analysis to quickly reacting to changes and outages. Monitoring can also
support compliance reporting by continuously checking that
infrastructure is meeting organizational standards.
I learned how to use several AWS monitoring tools:
- Amazon CloudWatch Metrics
- Amazon CloudWatch Logs
- Amazon CloudWatch Events
- AWS Config
By the end, I successfully:
-
Used AWS Systems Manager Run Command to install the CloudWatch agent
on Amazon Elastic Compute Cloud (Amazon EC2) instances
-
Monitored application logs using CloudWatch agent and CloudWatch Logs
-
Monitored system metrics using CloudWatch agent and CloudWatch Metrics
- Created real-time notifications using CloudWatch Events
- Tracked infrastructure compliance using AWS Config
Task 1: Installing the CloudWatch Agent
I started by installing the CloudWatch agent on an EC2 instance. The
CloudWatch agent is really versatile - it can collect metrics from both
EC2 instances and on-premises servers, including:
-
System-level metrics from EC2 instances (CPU allocation, free disk
space, and memory utilization). These metrics are collected from the
machine itself and complement the standard CloudWatch metrics that
CloudWatch collects.
-
System-level metrics from on-premises servers that enable the
monitoring of hybrid environments and servers not managed by AWS.
-
System and application logs from both Linux and Windows servers.
-
Custom metrics from applications and services using the StatsD and
collectd protocols.
Here's how I did it:
-
First, I opened the AWS Management Console and selected Systems
Manager from the Services menu.
-
In the navigation pane, I chose Run Command. I had to click the icon
in the top-left corner to make the navigation pane visible.
-
I clicked "Run a Command" and selected the AWS-ConfigureAWSPackage
option (which typically appears toward the top of the list).
- Under Command parameters, I set:
Action: Install
Name: AmazonCloudWatchAgent
Version: latest
-
For Targets, I chose "Choose instances manually" and selected the Web
Server instance.
-
I clicked Run and waited for the Overall status to change to Success,
occasionally refreshing the page using the refresh button toward the
top of the page.
-
To verify success, I viewed the output by clicking the expand icon
next to the instance under Targets and outputs, then clicked View
output.
-
I expanded Step 1 - Output and saw "Successfully installed
arn:aws:ssm:::package/AmazonCloudWatchAgent."
I noticed a message about "Step execution skipped due to unsatisfied
preconditions: '"StringEquals": [platformType, Windows]'. Step name:
createDownloadFolder" for Windows platforms, but this was expected since
I was using a Linux instance, so I safely ignored it. I could select
Step 2 - Output instead because the instance was created from a Linux
AMI.
Next, I needed to configure the CloudWatch agent to collect web server
logs and system metrics. I stored this configuration in AWS Systems
Manager Parameter Store:
- In the navigation pane, I selected Parameter Store.
- I clicked Create parameter and configured:
Name: Monitor-Web-Server
Description: Collect web logs and system metrics
Value: I pasted a JSON configuration that defined:
jsonCopy{ "logs": { "logs_collected": { "files": { "collect_list": [ {
"log_group_name": "HttpAccessLog", "file_path":
"/var/log/httpd/access_log", "log_stream_name": "{instance_id}",
"timestamp_format": "%b %d %H:%M:%S" }, { "log_group_name":
"HttpErrorLog", "file_path": "/var/log/httpd/error_log",
"log_stream_name": "{instance_id}", "timestamp_format": "%b %d %H:%M:%S"
} ] } } }, "metrics": { "metrics_collected": { "cpu": { "measurement": [
"cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user",
"cpu_usage_system" ], "metrics_collection_interval": 10, "totalcpu":
false }, "disk": { "measurement": [ "used_percent", "inodes_free" ],
"metrics_collection_interval": 10, "resources": [ "*" ] }, "diskio": {
"measurement": [ "io_time" ], "metrics_collection_interval": 10,
"resources": [ "*" ] }, "mem": { "measurement": [ "mem_used_percent" ],
"metrics_collection_interval": 10 }, "swap": { "measurement": [
"swap_used_percent" ], "metrics_collection_interval": 10 } } } }
I examined the configuration and found it defined the following items to
be monitored:
-
Logs: Two web server log files to be collected and sent to CloudWatch
Logs
-
Metrics: CPU, disk, and memory metrics to be sent to CloudWatch
Metrics
I clicked Create parameter to store this parameter for reference when
starting the CloudWatch agent.
After creating the parameter, I started the CloudWatch agent on the web
server:
- I went back to Run Command and clicked "Run command".
- I filtered for "AmazonCloudWatch-ManageAgent" by:
- Selecting Document name prefix
- Selecting Equals
- Entering AmazonCloudWatch-ManageAgent
-
Verifying the filter was Document name prefix : Equals :
AmazonCloudWatch-ManageAgent
- Pressing Enter
-
Before running it, I viewed the command definition by clicking on
AmazonCloudWatch-ManageAgent (the name itself).
-
A new browser tab opened showing the definition of the command. I
browsed through the content of each tab to see how a command document
is defined.
-
I checked the Content tab and scrolled to the bottom to see the actual
script that would run on the target instance. The script referenced
the AWS Systems Manager Parameter Store to retrieve the CloudWatch
agent configuration that I defined earlier.
-
After closing that tab, I selected AmazonCloudWatch-ManageAgent and
configured:
Action: configure
Mode: ec2
Optional Configuration Source: ssm
Optional Configuration Location: Monitor-Web-Server
Optional Restart: yes
-
For Targets, I chose "Choose instances manually" and selected the Web
Server.
-
I clicked Run and waited for Success status, occasionally refreshing
the page.
At this point, the CloudWatch agent was running and sending log and
metric data to CloudWatch.
Task 2: Monitoring Application Logs Using CloudWatch Logs
CloudWatch Logs lets me monitor applications and systems using log data.
For example, CloudWatch Logs can track the number of errors that occur
in application logs and send a notification whenever the rate of errors
exceeds a threshold that I specify.
CloudWatch Logs uses existing log data for monitoring, so no code
changes are required. For example, I can monitor application logs for
specific literal terms (such as "NullReferenceException") or count the
number of occurrences of a literal term at a particular position in log
data (such as 404 status codes in a web server access log). When the
term being searched for is found, CloudWatch Logs reports the data to a
CloudWatch metric that I specify. Log data is encrypted while in transit
and while it is at rest.
The Web Server generates two types of log data:
I generated some log data on the Web Server to monitor with CloudWatch
Logs:
-
I clicked the Details dropdown menu above the instructions and clicked
Show.
-
I copied the WebServerIP value and opened it in a new browser tab,
which showed a web server Test Page.
-
To generate log data, I appended "/start" to the URL, which generated
a 404 error since the page doesn't exist. This was intentional to
generate data in the access logs.
-
I kept this tab open but returned to the AWS Management Console.
- From the Services menu, I opened CloudWatch.
-
In the navigation pane, I chose Log groups and saw two logs:
HttpAccessLog and HttpErrorLog. (If these logs weren't listed, I
waited a minute and clicked Refresh.)
-
I clicked on HttpAccessLog (the name itself) and then selected the Log
stream in the table (which had the same ID as the EC2 instance).
-
I saw log data consisting of GET requests, including information about
the computer and browser that made the request. I expanded lines using
the arrow to view additional information.
-
I found a line with my /start request with a code of 404, indicating
the page was not found.
This demonstrated how log files can be automatically shipped from an EC2
instance or an on-premises server to CloudWatch Logs, making log data
accessible without having to log in to each individual server. Log data
can also be collected from multiple servers, such as an Auto Scaling
fleet of web servers.
Creating a Metric Filter in CloudWatch Logs
I configured a filter to identify 404 Errors in the log file, which
would normally indicate that the web server is generating invalid links
that users are choosing:
- In Log groups, I selected the checkbox next to HttpAccessLog.
-
From the Actions dropdown menu, I selected Create metric filter.
-
I entered this filter pattern: [ip, id, user, timestamp, request,
status_code=404, size]
-
This tells CloudWatch Logs how to interpret the fields in the log data
and defines a filter to find lines only with status_code=404.
-
In the Test pattern section, I used the dropdown menu to select the
EC2 instance ID (similar to i-0f07ab62aae4xxxx9).
- I clicked Test pattern and then Show test results.
-
I confirmed I could see at least one result with a $status_code of
404.
- I clicked Next and set:
Filter name: 404Errors
Metric namespace: LogMetrics
Metric name: 404Errors
Metric value: 1
-
I clicked Next (clicking an empty text field first if Next wasn't
enabled).
- On the Review and create page, I clicked Create metric filter.
This metric filter could now be used in an alarm.
Creating an Alarm Using the Filter
I configured an alarm to notify me when too many 404 errors occur:
-
In the 404Errors panel, I selected the checkbox in the top-right
corner.
- In the Metric filters section, I clicked Create alarm.
- I configured:
Period: 1 minute
Conditions: Greater/Equal than 5
-
I clicked Next and created a new SNS topic with my email address.
- I clicked Create topic, then Next.
- I set:
Alarm name: 404 Errors
Alarm description: Alert when too many 404s detected on an instance
- I clicked Next and Create alarm.
-
I confirmed the subscription by clicking the link in the confirmation
email.
-
Back in CloudWatch, the alarm appeared in orange, indicating
"Insufficient data" because no data had been received in the past
minute.
To test the alarm, I:
-
Returned to the web server browser tab. (If it was no longer open, I
reopened it using the WebServerIP from the Details menu.)
-
Attempted to go to non-existent pages at least five times by adding
different page names after the IP address (e.g.,
http://192.0.2.0/start2).
-
Waited 1-2 minutes for the alarm to trigger, refreshing occasionally.
- Confirmed the graph turned red, indicating the Alarm state.
-
Checked my email and found an alarm notification with the subject
"ALARM: 404 Errors".
This demonstrated how to create alarms from application log data and
receive alerts for unusual behavior, with the log file accessible within
CloudWatch Logs for further analysis.
Task 3: Monitoring Instance Metrics Using CloudWatch
Metrics are data about the performance of systems. CloudWatch stores
metrics for the AWS services used, and I can also publish my own
application metrics either via the CloudWatch agent or directly from
applications. CloudWatch can present the metrics for search, graphs,
dashboards, and alarms.
I examined EC2 metrics:
- From the Services menu, I chose EC2.
- In the navigation pane, I selected Instances.
-
I selected the Web Server and examined the Monitoring tab in the lower
half of the page.
-
I noted that CloudWatch captures metrics about CPU, disk, and network
usage on the instance, viewing it from the outside as a virtual
machine.
These metrics don't give insight into what's running inside the
instance, such as measuring free memory or free disk space. Fortunately,
the CloudWatch agent runs inside the instance to collect these internal
metrics.
To view the CloudWatch agent metrics:
- From the Services menu, I selected CloudWatch.
-
In the navigation pane, I chose Metrics, then expanded Metrics and
selected All metrics.
-
I saw various metrics in the lower half of the page, some
automatically generated by AWS and others collected by the CloudWatch
agent.
-
I chose CWAgent, then device, fstype, host, path to see disk space
metrics.
-
I clicked CWAgent above the table (in the line showing All > CWAgent >
device, fstype, host, path), then chose host to see metrics related to
system memory.
-
I clicked All again and explored other metrics, selecting ones I
wanted to appear on the graph.
Task 4: Creating Real-Time Notifications
CloudWatch Events delivers a near-real-time stream of system events
describing changes in AWS resources. Simple rules can match events and
route them to target functions or streams. CloudWatch Events becomes
aware of operational changes as they occur.
CloudWatch Events can respond to operational changes, take corrective
action, send messages to respond to the environment, activate functions,
make changes, and capture state information. It can also schedule
automated actions using cron or rate expressions.
I created a real-time notification for instance state changes:
-
In CloudWatch, I expanded Events in the navigation pane and chose
Rules.
- I clicked Create rule.
- I configured the Event Source:
Service Name: EC2
Event Type: EC2 Instance State-change Notification
Selected the checkbox for Specific state(s)
From the dropdown menu, selected stopped and terminated
In the Targets section, I:
- Clicked Add target
-
Selected SNS topic from the dropdown menu (instead of Lambda function)
- For Topic, selected Default_CloudWatch_Alarms_Topic
-
I clicked Configure details, named the rule
"Instance_Stopped_Terminated", and clicked Create rule.
Configure a Real-Time Notification
I could configure Amazon Simple Notification Service (Amazon SNS) to
send notifications to my phone via SMS or to my email. Since configuring
SMS messaging requires opening a ticket with AWS Support and takes time
to configure, I used email instead.
I noted that more information about configuring SMS messaging with SNS
is available in the Amazon Simple Notification Service Developer Guide.
- From the Services menu, I chose Simple Notification Service.
- In the navigation pane, I chose Topics.
-
I clicked the link in the Name column and saw a single subscription
associated with my email address (the Topic I configured in Task 2).
- From the Services menu, I chose EC2.
- In the navigation pane, I chose Instances.
-
I selected the Web Server, clicked Instance state, then Stop instance,
and then Stop.
-
The Web Server entered the Stopping state and then the Stopped state
after a minute.
-
I received an email with details about the stopped instance in JSON
format.
I noted that to receive a more readable message, I could create an AWS
Lambda function triggered by CloudWatch Events. The Lambda function
could format a more readable message and send it via Amazon SNS.
This demonstrated how to receive real-time notifications when
infrastructure changes.
Task 5: Monitoring for Infrastructure Compliance
With AWS Config, I can assess, audit, and evaluate the configurations of
AWS resources. AWS Config continuously monitors and records AWS resource
configurations and allows automated evaluation of recorded
configurations against desired configurations.
AWS Config lets me review changes in configurations and relationships
between AWS resources, dive into detailed resource configuration
histories, and determine overall compliance against configurations
specified in internal guidelines. It simplifies compliance auditing,
security analysis, change management, and operational troubleshooting.
I set up AWS Config rules:
- From the Services menu, I chose Config.
-
If a Get started button appeared, I completed initial setup by
clicking:
Get started
Next
Next
Confirm
-
This configured AWS Config for initial use, and I closed the Welcome
window.
- In the navigation pane, I chose Rules (toward the top).
-
I clicked Add rule and searched for "required-tags" in the AWS Managed
Rules section.
- I selected required-tags and clicked Next.
-
Under Parameters, I set tag1Key to "project" (replacing any existing
value).
- I clicked Next and Add rule.
This rule looks for resources without a project tag. It takes a few
minutes to complete, so I continued with the next steps.
I then added a rule to check for unused EBS volumes:
- I clicked Add rule and searched for "ec2-volume-inuse-check".
- I selected it, clicked Next twice, and Add rule.
-
I waited for at least one rule to complete evaluation, refreshing if
needed.
-
If I saw "No resources in scope," I waited longer as AWS Config was
still scanning resources.
-
I examined each rule by selecting "Compliant" under Resources in
scope.
The results showed:
-
required-tags: A compliant EC2 instance (Web Server with project tag)
and many non-compliant resources without project tags
-
ec2-volume-inuse-check: One compliant volume (attached to an instance)
and one non-compliant volume (not attached)
I learned that AWS Config has a large library of pre-defined compliance
checks, and custom checks can be created using Lambda.
In conclusion, this work gave me hands-on experience with AWS monitoring
tools that are essential for maintaining reliable systems. I learned how
to collect and analyze logs, track metrics both from outside and inside
instances, receive real-time notifications of infrastructure changes,
and ensure compliance with organizational standards.
×